Capstone Project - Automatic Ticketing Statement

Problem Statement

  1. Pre-Processing, Data Visualisation and EDA
  1. Model Building
  1. Test the Model, Fine-tuning and Repeat

Mounting the Google Drive

Importing the Libraries

Maintaining the Project and Dataset Path

PART 1 - Pre-Processing, Data Visualization and EDA

Displaying first 5 records of Dataset

Displaying last 5 records of Dataset

Printing information about the data

Printing description of the dataset with various summary and statistics

Finding out the NULL values in each column

Printing the summary of dataset after NULL treatment

Finding the duplicates dataset

Removing Duplicates in the dataset

We can address NULL/Missing values in the dataset in a variety of ways, including:

Mojibake

Mojibake is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

This display may include the generic replacement character ("�") in places where the binary representation is considered invalid. A replacement can also involve multiple consecutive symbols, as viewed in one encoding, when the same binary code constitutes one symbol in the other encoding. This is either because of differing constant length encoding (as in Asian 16-bit encodings vs European 8-bit encodings), or the use of variable length encodings (notably UTF-8 and UTF-16). Few such Mojibakes are ¶, ç, å, €, æ, œ, º, ‡, ¼, ¥ etc.

As we're dealing with Natural Language and the source of the data is unknown to us, let's run the encoding check to figure out if the dataset is Mojibake impacted.

The library ftfy (Fixes Text For You) has a greater ability to detect, fix and deal with such Mojibakes. It fixes Unicode that’s broken in various ways. The goal of ftfy is to take in bad Unicode and output good Unicode.

Comments:

Language Translation (Goslate: Free Google Translate API)

Goslate is an open source python library that implemented Google Translate API. This uses the Google Translate Ajax API to make calls to such methods as detect and translate. It is choosen over another library Googletrans from Google as Goslate is developed to bypass the ticketing mechanism to prevent simple crawler program to access the Ajax API. Hence Goslate with multiple service urls is able to translate the entire dataset in very few iterations without blocking the user's IP address.

Comments:

Unless paid service is used, Google blocks repetative hits to its Ajax API either via Googletrans or Goslate after certain iterations by cloagging the IP address. Using these list of various domains of translation API as service urls helped the traffic being patched among themselves, in turn allowing a longer buffer before the IP gets blocked.

Text Preprocessing

Text preprocessing is the process of transferring text from human language to machine-readable format for further processing. After a text is obtained, we start with text normalization. Text normalization includes:

Stemming and Lemmatization

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

Visually representing the content of a text document is one of the most important tasks in the field of text mining. It helps not only to explore the content of documents from different aspects and at different levels of details, but also helps in summarizing a single document, show the words and topics, detect events, and create storylines.

We'll be using plotly library to generate the graphs and visualizations. We need cufflinks to link plotly to pandas dataframe and add the iplot method

Univariate visualization

Single-variable or univariate visualization is the simplest type of visualization which consists of observations on only a single characteristic or attribute. Univariate visualization includes histogram, bar plots and line charts.

Comments:

The distribution of Callers

Comments:

The distribution of Short description lengths

The distribution of Description lengths

Comments:

Model Building

Let's proceed towards trying different model architectures mentioned below to classify the problem and validate which one is outperforming.

Let's create another column of categorical datatype from Assignment groups. Let's write some generic methods for utilities and to plot evaluation metrics.

Naive Bayes Classifier

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

Advantages:

Disadvantages:

K-nearest Neighbor

Support Vector Machine (SVM)

Decision Tree

Random Forest

Observations:

We'll be fine tuning the models and reduce the overfitting in next iteration.

Neural Network

Deep Neural Networks

Extract Glove Embeddings

Recurrent Neural Networks (RNN)

Recurring Convolution Neural Network(RCNN)

RNN with LSTM networks